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Abstract 

Previous work on predicting the target of visual search 
from human fixations only considered closed-world settings 
in which training labels are available and predictions are 
performed for a known set of potential targets. In this work 
we go beyond the state of the art by studying search target 
prediction in an open-world setting in which we no longer 
assume that we have fixation data to train for the search 
targets. We present a dataset containing fixation data of 
18 users searching for natural images from three image cat¬ 
egories within synthesised image collages of about 80 im¬ 
ages. In a closed-world baseline experiment we show that 
we can predict the correct target image out of a candidate 
set of five images. We then present a new problem formula¬ 
tion for search target prediction in the open-world setting 
that is based on learning compatibilities between fixations 
and potential targets. 


1. Introduction 

In his seminal work from 1967, Yarbus showed that vi¬ 
sual behaviour is closely linked to task when looking at a 
visual scene [39]. This work is an important demonstration 
of task influence on fixation patterns and sparked a large 
number of follow-up works in a range of disciplines, includ¬ 
ing human vision, neuroscience, artificial intelligence, and 
computer vision. A common goal in these human and com¬ 
puter vision works is to analyse visual behaviour, i.e. typ¬ 
ically fixations and saccades, in order to make predictions 
about user behaviour. For example, previous work has used 
visual behaviour analysis as a means to predict the users’ 
tasks [1, 3, 13, 17, 22, 41, 42], visual activities [6, 7, 8, 26, 
33], cognitive processes such as memory recall or high cog¬ 
nitive load [5, 38], abstract thought processes [12, 28], the 
type of a visual stimulus [4, 10, 23], interest for interactive 
image retrieval [11, 15, 20, 25, 32, 37, 43], which number a 
person has in mind [27], or - most recently - to predict the 
search target during visual search [2, 16, 34, 41]. 

Predicting the target of a visual search task is particularly 




Figure 1: Experiments conducted in this work. In the closed- 
world experiment we aim to predict which target image 
(here Q 2 ) out of a candidate set of five images Q train = 
Qtest the user is searching for by analysing fixations Fi on 
an image collage C. In the open-world experiments we aim 
to predict Qi on the whole Qtest- 


interesting, as the corresponding internal representation, the 
mental image of the search target, is difficult if not impos¬ 
sible to assess using other modalities. While [41] and [2] 
underlined the significant potential of using gaze informa¬ 
tion to predict visual search targets, they both considered a 
closed-world setting. In this setting, all potential search tar¬ 
gets are part of the training set, and fixations for all of these 
targets were observed. 

In contrast, in this work we study an open-world setting 
in which we no longer assume that we have fixation data 
to train for these targets. Search target prediction in this set¬ 
ting has significant practical relevance for a range of applica¬ 
tions, such as image and media retrieval. This setting is chal¬ 
lenging because we have to develop a learning mechanism 
that can predict over an unknown set of targets. We study 
this problem on a new dataset that contains fixation data of 


1 










18 users searching for five target images from three cate¬ 
gories (faces as well as two different sets of book covers) in 
collages synthesised from about 80 images. The dataset is 
publicly available online. 

The contributions of this work are threefold. First, we 
present an annotated dataset of human fixations on synthe¬ 
sised collages of natural images during visual search that 
lends itself to studying our new open-world setting. Com¬ 
pared to previous works, our dataset is more challenging be¬ 
cause of its larger number of distractors, higher similarities 
between search image and distractors, and a larger number 
of potential search targets. Second, we introduce a novel 
problem formulation and method for learning the compat¬ 
ibility between observed fixations and potential search tar¬ 
gets. Third, using this dataset, we report on a series of ex¬ 
periments on predicting users’ search target from fixations 
by moving from closed-world to open-world settings. 

2. Related Work 

Our work is related to previous works on analysing gaze 
information in order to make predictions about general user 
behaviour as well as on predicting search targets from fixa¬ 
tions during visual search tasks. 

2.1. Predicting User Behaviour From Gaze 

Several researchers recently aimed to reproduce Yarbus’s 
findings and to extend them by automatically predicting the 
observers’ tasks. Green et al. reproduced the original ex¬ 
periments, but although they were able to predict the ob¬ 
servers’ identity and the observed images from the scan- 
paths, they did not succeed in predicting the task itself [14]. 
Borji et al., Kanan et al., and Haji-Abolhassani et al. con¬ 
ducted follow-up experiments using more sophisticated fea¬ 
tures and machine learning techniques [1, 17, 22]. All three 
works showed that the observers’ tasks could be success¬ 
fully predicted from gaze information alone. 

Other works investigated means to recognise more gen¬ 
eral aspects of user behaviour. Bulling et al. investigated 
the recognition of everyday office activities from visual 
behaviour, such as reading, taking hand-written notes, or 
browsing the web [7]. Based on long-term eye movement 
recordings, they later showed that high-level contextual 
cues, such as social interactions or being mentally active, 
could also be inferred from visual behaviour [8]. They fur¬ 
ther showed that cognitive processes, such as visual mem¬ 
ory recall or cognitive load, could be inferred from gaze 
information [5, 38] as well - of which the former finding 
was recently confirmed by Henderson et al. [19]. 

Several previous works investigated the use of gaze in¬ 
formation as an implicit measure of relevance in image re¬ 
trieval tasks. For example, Oyekoya and Stendiford com¬ 
pared similarity measures based on a visual saliency model 


as well as real human gaze patterns, indicating better per¬ 
formance for gaze [30]. In later works the same and other 
authors showed that gaze information yielded significantly 
better performance than random selection or using saliency 
information [31, 36]. Coddington presented a similar sys¬ 
tem but used two separate screens for the task [11] while 
Kozma et al. focused on implicit cues obtained from gaze 
in real-time interfaces [25]. With the goal of making im¬ 
plicit relevance feedback richer, Klami proposed to infer 
which parts of the image the user found most relevant from 
gaze [24]. 

2.2. Predicting Search Targets From Gaze 

Only a few previous works here focused on visual search 
and the problem of predicting search targets from gaze. 
Zelinsky et al. aimed to predict subjects’ gaze patterns dur¬ 
ing categorical search tasks [40]. They designed a series 
of experiments in which participants had to find two cate¬ 
gorical search targets (teddy bear and butterfly) among four 
visually similar distractors. They predicted the number of 
fixations made prior to search judgements as well as the per¬ 
centage of first eye movements landing on the search target. 
In another work they showed how to predict the categorical 
search targets themselves from eye fixations [41]. Borji et al. 
focused on predicting search targets from fixations [2]. In 
three experiments, participants had to find a binary pattern 
and 3-level luminance patterns out of a set of other patterns, 
as well as one of 15 objects in 11 synthetic natural scenes. 
They showed that binary patterns with higher similarity to 
the search target were viewed more often by participants. 
Additionally, they found that when the complexity of the 
search target increased, participants were guided more by 
sub-patterns rather than the whole pattern. 

2.3. Summary 

The works of Zelinsky et al. [41] and Borji et al. [2] are 
most related to ours. However, both works only considered 
simplified visual stimuli or synthesised natural scenes in a 
closed-world setting. In that setting, all potential search tar¬ 
gets were part of the training set and fixations for all of these 
targets were observed. In contrast, our work is the first to 
address the open-world setting in which we no longer as¬ 
sume that we have fixation data to train for these targets, and 
to present a new problem formulation for this open-world 
search target recognition in the open-world setting. 

3. Data Collection and Collage Synthesis 

Given the lack of an appropriate dataset, we designed a 
human study to collect fixation data during visual search. In 
contrast to previous works that used 3x3 squared patterns 
at two or three luminance levels, or synthesised images of 
natural scenes [2], our goal was to collect fixations on col¬ 
lages of natural images. We therefore opted for a task that 



Figure 2: Sample image collages used for data collection: 
O’Reilly book covers (top), Amazon book covers (middle), 
mugshots (bottom, blurred for privacy reasons). Partici¬ 
pants were asked to find different targets within random per¬ 
mutations of these collages. 


involved searching for a single image (the target) within a 
synthesised collage of images (the search set). Each of the 
collages are the random permutation of a finite set of im¬ 
ages. To explore the impact of the similarity in appearance 
between target and search set on both fixation behaviour and 
automatic inference, we have created three different search 
tasks covering a range of similarities. 

In prior work, colour was found to be a particularly im¬ 
portant cue for guiding search to targets and target-similar 


objects [21, 29]. Therfore we have selected for the first task 
78 coloured O’Reilly book covers to compose the collages. 
These covers show a woodcut of an animal at the top and the 
title of the book in a characteristic font underneath (see Fig¬ 
ure 2 top). Given that overall cover appearance was very 
similar, this task allows us to analyse fixation behaviour 
when colour is the most discriminative feature. 

For the second task we use a set of 84 book covers from 
Amazon. In contrast to the first task, appearance of these 
covers is more diverse (see Figure 2 middle). This makes it 
possible to analyse fixation behaviour when both structure 
and colour information could be used by participants to find 
the target. 

Finally, for the third task, we use a set of 78 mugshots 
from a public database of suspects. In contrast to the other 
tasks, we transformed the mugshots to grey-scale so that 
they did not contain any colour information (see Figure 2 
bottom). In this case, allows abalysis of fixation behaviour 
when colour information was not available at all. We found 
faces to be particularly interesting given the relevance of 
searching for faces in many practical applications. 

We place images on a grid in order to form collages that 
we show to the participants. Each collage is a random per¬ 
mutation of the available set of images on the grid. The 
search targets are subset of images in the collages. We opted 
for an independent measures design to reduce fatigue (the 
current recording already took 30 minutes of concentrated 
search to complete) and learning effects that both may have 
influenced fixation behaviour. 

3.1. Participants, Apparatus, and Procedure 

We recorded fixation data of 18 participants (nine male) 
with different nationalities and aged between 18 and 30 
years. The eyesight of nine participants was impaired but 
corrected with contact lenses or glasses. To record gaze 
data we used a stationary Tobii TX300 eye tracker that pro¬ 
vides binocular gaze data at a sampling frequency of 300Hz. 
Parameters for fixation detection were left at their defaults: 
fixation duration was set to 60ms while the maximum time 
between fixations was set to 75ms. The stimuli were shown 
on a 30 inch screen with a resolution of 2560x1600 pixels. 

Participants were randomly assigned to search for targets 
for one of the three stimulus types. We first calibrated the 
eye tracker using a standard 9-point calibration, followed 
by a validation of eye tracker accuracy. After calibration, 
participants were shown the first out of five search targets. 
Participants had a maximum of 10 seconds to memorise the 
image and 20 seconds to subsequently find the image in the 
collage. Collages were displayed full screen and consisted 
of a fixed set of randomly ordered images on a grid. The 
target image always appeared only once in the collage at a 
random location. 

To determine more easily which images participants fix- 












































































ated on, all images were placed on a grey background and 
had a margin to neighbouring images of on average 18 pix¬ 
els. As soon as participants found the target image they 
pressed a key. Afterwards they were asked whether they 
had found the target and how difficult the search had been. 
This procedure was repeated twenty times for five different 
targets, resulting in a total of 100 search tasks. To minimise 
lingering on search taget, participants were put under time 
pressure and had to find the target and press a confirmation 
button as quickly as possible. This resulted in lingering of 
2.45% for Amazon (O’Reilly: 1.2%, mugshots: 0.35%). 

4. Method 

In this work we are interested in search tasks in which 
the fixation patterns are modulated by the search target. Pre¬ 
vious work focused on predicting a fixed set of targets for 
which fixation data was provided at training time. We call 
this the closed-world setting. In contrast, our method en¬ 
ables prediction of new search targets, i.e. those for which 
no fixation is available for training. We refer to this as the 
open-world setting. In the following, we first provide a 
problem formulation for the previously investigated closed- 
world setting. Afterwards we present a new problem formu¬ 
lation for search target prediction in an open-world setting 
(see Figure 1). 

4.1. Search Target Prediction 


Strain — {F(C,Q,P)|VQ e Qtrain}- The task is to predict 
the search target while the query and/or participant changes 
(see Figure 1). 

4>(F(C,Q,P),V)^Qe® train (3) 

We use a one-vs-all multi-class SVM classifier Hi and 
the query image with the largest margin: 

Qi= argmax 'Hi(4>(F test ,V)) (4) 

i= 1,..., | Qtest | 

4.3. Open-World Setting 

In contrast, in our new open-world setting, we no longer 
assume that we have fixation data to train for these targets. 
Therefore Q test fl Qtrain = 0- The main challenge that arises 
from this setting is to develop a learning mechanism that 
can predict over a set of classes that is unknown at training 
time (see Figure 1). 

Search Target Prediction To circumvent the problem of 
training for a fixed number of search targets, we propose to 
encode the search target into the feature vector, rather than 
considering it a class that is to be recognised. This leads 
to a formulation where we learn compatibilities between ob¬ 
served fixations and query images: 

(F(C,Qi,P),Qj)^Ye{ 0,1} (5) 


Given a query image (search target) Q G Q and a stim¬ 
ulus collage C G C, during a search task participants 
P G P perform fixations F(C,Q,P) = {(xi,yi,ai),i = 
1 ,..., N}, where each fixation is a triplet of positions Xi, yi 
in screen coordinates and appearance at the fixated loca¬ 
tion. To recognise search targets we aim to find a mapping 
from fixations to query images: 

F(C,Q)^QgQ ( 1 ) 

We use a bag of visual world featurisation </> of the fix¬ 
ations. We interpret fixations as key points around which 
we extract local image patches. These are clustered into a 
visual vocabulary V and accumulated in a count histogram. 
This leads to a fixed-length vector representation of dimen¬ 
sion \V\ commonly known as a bag of words. Therefore, 
our recognition problem can more specifically be expressed 
as: 


^(F(C,Q,P),T)^QgQ (2) 

4.2. Closed-World Setting 

We now formulate the previously investigated case of 
the closed-world setting where all test queries (search tar¬ 
gets) Q G Qtest are part of our training set Q tes t = Qtrain 
and, in particular, we assume that we observe fixations 


Training is performed by generating data points of all 
pairs of Qi and Qj in Qtrain and assigning a compatibility 
label Y accordingly: 


Y = 



if i = j 

if M j 


( 6 ) 


The intuition behind this approach is that the compatibil¬ 
ity predictor learns about similarities in fixations and search 
targets that can also be applied to new fixations and search 
targets. 

Similar to the closed-world setting, we propose a featuri¬ 
sation of the fixations and query images. Although we can 
use the same fixation representation as before, we do not 
have fixations for the query images. Therefore, we intro¬ 
duce a sampling strategy S which still allows us to gener¬ 
ate a bag-of-words representation for a given query. In this 
work we propose to use sampling from the saliency map 
as a sampling strategy. We stack the representation of the 
fixation and the query images. This leads to the following 
learning problem: 


4>(F(C, Qi, P), V) 

<HS(Qi)) 


e {0,1} 


(7) 


We learn a model for the problem by training a single binary 
SVM B classifier according to the labelling as described 



Figure 3: Proposed approach of sampling eight additional 
image patches around each fixation location to compensate 
for eye tracker inaccuracy. The size of orange dots corre¬ 
sponds to the fixation’s duration. 


above. At test time we find the query image describing the 
search target by 


Q = arg max B 

Qj GQtest 


m^suV) \ 

<KS(Q s )) ) 


( 8 ) 


Note that while we do not require fixation data for the 
query images that we want to predict at test time, we still 
search over a finite set of query images Q te st- 


5. Experiments 

Our dataset contains fixation data from six participants 
for each search task. To analyse the first and second search 
task (O’Reilly and Amazon book covers) we used RGB val¬ 
ues extracted from a patch (window) of size m x m around 
each fixation as input to the bag-of-words model. For the 
third search task (mugshots) we calculated a histogram of 
local binary patterns from each fixation patch. To compen¬ 
sate for inaccuracies of the eye tracker we extracted eight 
additional points with non-overlapping patches around each 
fixation (see Figure 3). Additionally, whenever an image 
patch around a fixation had overlap with two images in the 
collage, pixel values in the area of the overlap were set to 
128. 


5.1. Closed-World Evaluation 

In our closed-world evaluation we distinguish between 
within-participant and cross-participant predictions. In the 
“within participant” condition we predict the search target 
for each participant individually using their own training 
data. In contrast, for the “cross participant” condition, we 
predict the search target across participants. The “cross par¬ 
ticipant” condition is more challenging as the algorithm has 
to generalise across users. Chance level is defined based on 
the number of search targets or classes our algorithm is go¬ 
ing to predict. Participants were asked to search for five dif¬ 
ferent targets in each experiment (chance level 1/5 = 20%). 
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Figure 4: Closed-world evaluation results showing mean 
and standard deviation of within-participant prediction ac¬ 
curacy for Amazon book covers, O’Reilly book covers, and 
mugshots. Mean performance is indicated with black lines, 
and the chance level is indicated with the dashed line. 


Within-Participant Prediction 

Participants looked for each search target 20 times. To 
train our classifier we used the data from 10 trials and the 
remaining 10 trials were used for testing. We fixed the 
patch (window) size to 41 x 41 and optimised k (vocabulary 
size) for each participant. Figure 4 summarises the within- 
participant prediction accuracies for the three search tasks. 
Accuracies were well above chance for all participants for 
the Amazon book covers (average accuracy 75%) and the 
O’Reilly book covers (average accuracy 69%). Accuracies 
were lower for mugshots but still above chance level (aver¬ 
age accuracy 30%, chance level 20%). 


Cross-Participant Prediction 

We investigated whether search targets could be predicted 
within and across participants. In the accross-participants 
case, we trained one-vs-all multi-class SVM classifier us¬ 
ing 3-fold cross-validation. We trained our model with data 
from three participants to map the observer-fixated patch to 
the target image. The resulting classifier was then tested 
on data from the remaining three participants. Prior to our 
experiments, we ran a control experiment where we uni¬ 
formly sampled from 75% of the salient part of the col¬ 
lages. We trained the classifier with these randomly sam¬ 
pled fixations and confirmed that performance was around 
the chance level of 20% and therefore any improvement can 
indeed be attributed to information contained in the fixation 
patterns. 

Figure 5 summarises the cross-participant prediction ac¬ 
curacies for Amazon book covers, O’Reilly book covers, 
and mugshots for different window sizes and size of vo¬ 
cabulary k, as well as results with (straight lines) and with- 
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Figure 5: Closed-world evaluation results showing mean 
and standard deviation of cross-participant prediction ac¬ 
curacy for Amazon book covers (top), O’Reilly book cov¬ 
ers (middle), and mugshots (bottom). Results are shown 
with (straight lines) and without (dashed lines) using the 
proposed sampling approach around fixation locations. The 
chance level is indicated with the dashed line. 


out (dashed lines) using the proposed sampling approach 
around fixation locations. The optimum k represents the 
upper bound and corresponds to always choosing the value 
of k that optimises accuracy, while the minimum k corre¬ 
spondingly represents the lower bound. Average k refers to 
the practically most realistic setting in which we fix k = 60. 


Performance for Amazon book covers was best, followed 
by O’Reilly book covers and mugshots. Accuracies were 
between 61% =b 2% and 78% =b 2% for average k for Ama¬ 
zon and O’Reilly book covers but only around chance level 
for mugshots. 

5.2. Open-World Evaluation 

In the open-world evaluation the challenge is to predict 
the search target based on the similarity between fixations 
F(C,Q) and query image S(Q). In absence of fixations 
for query images Q we uniformly sample from the GBVS 
saliency map [18]. We chose the number of samples on 
the same order as the number of fixations on the collages. 
For the within-participant evaluation we used the data from 
three search targets of each participant to train a binary 
SVM with RBF kernel. The data from the remaining two 
search targets was used at test time. The average perfor¬ 
mance of all participants in each group was for Amazon: 
70.33%, O’Reilly: 59.66%, mugshots: 50.83%. 

Because the task is more challenging in the cross¬ 
participant evaluation, we report results for this task in more 
detail. As described perviously, we train a binary SVM 
with RBF kernel from data of three participants to learn the 
similarity between the observer-fixated patch when looking 
for three of the search targets and the corresponding target 
images. Our positive class contains data coming from the 
concatenation of Qi, P), V) and &(S(Qj)) when 

i = j. At test time, we then test on the data of remain¬ 
ing three participants looking for two other search targets 
that did not appear in the training set and the corresponding 
search targets. The chance level is 1/2 = 50% as we have 
a target vs, non-target decision. 

Figure 6 summarises the cross-participant prediction ac¬ 
curacies for Amazon book covers, O’Reilly book covers, 
and mugshots for different window sizes and size of vo¬ 
cabulary k, as well as results with (straight lines) and 
without (dashed lines) using the proposed sampling ap¬ 
proach around fixation locations. With average k the 
model achieves an accuracy of 75% for Amazon book 
covers, which is significantly higher than chance at 50%. 
For O’Reilly book covers accuracy reaches 55% and for 
mugshots we reach 56%. Similar to our closed-world set¬ 
ting, accuracy is generally better when using the proposed 
sampling approach. 

6. Discussion 

In this work we studied the problem of predicting the 
search target during visual search from human fixations. 
Figure 4 shows that we can predict the search target sig¬ 
nificantly above chance level for the within-participant case 
for the Amazon and O’Reilly book cover search tasks, with 
accuracies ranging from 50% to 78%. Figure 5 shows simi¬ 
lar results for the cross-participant case. These findings are 
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Figure 6: Open-World evaluation results showing mean 
and standard deviation of cross-participant prediction ac¬ 
curacy for Amazon book covers (top), O’Reilly book cov¬ 
ers (middle), and mugshots (bottom). Results are shown 
with (straight lines) and without (dashed lines) using the 
proposed sampling approach around fixation locations. The 
chance level is indicated with the dashed line. 


in line with previous works on search target prediction in 
closed-world settings [41,2]. Our findings extend these pre¬ 
vious works in that we study synthesised collages of natural 
images and in that our method has to handle a larger num¬ 
ber of distractors, higher similarities between search image 
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Figure 7: Average number of fixations per trial performed 
by each participant during the different search tasks. 

and distractors, and a larger number of potential search tar¬ 
gets. Instead of a large number of features, we rely only on 
colour information as well as local binary pattern features. 

We extended these evaluations with a novel open-world 
evaluation setting in which we no longer assume that we 
have fixation data to train for these targets. To learn under 
such a regime we proposed a new formulation where we 
learn compatibilities between observed fixations and query 
images. As can be seen from Figure 6, despite the much 
more challenging setting, using this formulation we can still 
predict the search target significantly above chance level for 
the Amazon book cover search task, and just about chance 
level for the other two search tasks for selected values of 
k. These results are meaningful as they underline the sig¬ 
nificant information content available in human fixation pat¬ 
terns during visual search, even in a challenging open-world 
setting. The proposed method of sampling eight additional 
image patches around each fixation to compensate for eye 
tracker inaccuracies proved to be necessary and effective 
for both evaluation settings and increased performance in 
the closed-world setting by up to 20%, and by up to 5% in 
the open-world setting. 

These results also support our initial hypothesis that the 
search task, i.e. in particular the similarity in appearance 
between target and search set and thus the difficulty, has a 
significant impact on both fixation behaviour and prediction 
performance. Figures 5 and 6 show that we achieved the 
best performance for the Amazon book covers, for which 
appearance is very diverse and participants can rely on both 
structure and colour information. The O’Reilly book covers, 
for which the cover structure was similar and colour was the 
most discriminative feature, achieved the second best perfor¬ 
mance. In contrast, the worst performance was achieved for 
the greyscale mugshots that had highly similar structure and 
did not contain any colour information. These findings are 
in line with previous works in human vision that found that 


























































Figure 8: Sample scanpaths of P8: Targeted search be¬ 
haviour with a low number of fixations (top), and skimming 
behaviour with a high number of fixations (bottom). Size of 
the orange dots corresponds to fixation durations. 


colour is a particularly important cue for guiding search to 
targets and target-similar objects [29, 21]. 

Analysing the visual strategies that participants used pro¬ 
vides additional interesting (yet anecdotal) insights. As the 
difficulty of the search task increased, participants tended 
to start skimming the whole collage rather than doing tar¬ 
geted search for specific visual features (see Figure 8 for 
an example). This tendency was the strongest for the most 
difficult search task, the mugshots, for which the vast ma¬ 
jority of participants assumed a skimming behaviour. Addi¬ 
tionally, as can be seen from Figure 9, our system achieved 
higher accuracy in search target prediction for participants 
who followed a specific search strategy than for those who 
skimmed most of the time. Well-performing participants 
also required fewer fixations to find the target (see Figure 7). 
Both findings are in line with previous works that describe 
eye movement control, i.e. the planning of where to fix- 
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Figure 9: Difference in accuracies of participants who have 
a strategic search pattern vs participants that mainly skim 
the collage to find the search image. 


ate next, as an information maximisation problem [9, 35]. 
While participants unconsciously maximised the informa¬ 
tion gain by fixating appropriately during search, in some 
sense, they also maximised the information available for our 
learning method, resulting in higher prediction accuracy. 

7. Conclusion 

In this paper we demonstrated how to predict the search 
target during visual search from human fixations in an open- 
world setting. This setting is fundamentally different from 
settings investigated in prior work, as we no longer assume 
that we have fixation data to train for these targets. To 
address this challenge, we presented a new approach that 
is based on learning compatibilities between fixations and 
potential targets. We showed that this formulation is ef¬ 
fective for search target prediction from human fixations. 
These findings open up several promising research direc¬ 
tions and application areas, in particular gaze-supported im¬ 
age and media retrieval as well as human-computer inter¬ 
action. Adding visual behaviour features and temporal in¬ 
formation to improve performance is a promising extension 
that we are planning to explore in future work. 
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